## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ tibble  3.1.7     ✔ purrr   0.3.4
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::extract() masks rstan::extract()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()

Introduction via simple version

n_organism <- 10
baseline_indices <- 2:n_organism
exponential_indices <- 1
n_bp <- 100
k <- 40
n_kmer <- n_bp - k
time_window <- 14
baseline <- 100
exp_initial <- 1
r <- 1
read_depth <- 10^4

Suppose that there are 10 organisms in total in the sample, and nothing else. Each organism has a genome that is exactly 100 base-pairs long. Each genome is broken into \(k\)-mers of length 40. There are 60 \(k\)-mers from each organism, obtained by subtracting the \(k\)-mer length from the genome size. Assume that each \(k\)-mer is unique, and there are no sequencing errors. We collect data over a period of 14 days \(t = 0, 1, \dots\) at a single environmental monitoring site.

Organism-level baseline and exponential regimes

We consider two regimes 1) baseline and 2) exponential growth. In the baseline regime, a species has a concentration \(c^{(b)}\) in water of 100 copy per \(\mu\)L1, independent of the day. \[ c^{(b)}_t = c^{(b)} \] In the exponential growth regime, a species starts at a concentration \(c^{(e)}_0\), and over the 14 day window its concentration increases exponentially according to \[ c^{(e)}_t = c^{(e)}_0 \exp(rt) \] where \(r\) is the growth rate which we take to be 1, and \(c^{(e)}_0\) is the initial exponential concentration which we take to be 1 – lower than the baseline concentration as we assume the exponentially increasing organism to be novel.

#' The concentration of the baseline organisms over the time window
conc_baseline <- rep(baseline, time_window)
conc_baseline
##  [1] 100 100 100 100 100 100 100 100 100 100 100 100 100 100
#' The concentration of the exponentially increasing organism over the time window
conc_exponential <- exp_initial * exp(r * 0:13)
conc_exponential
##  [1] 1.000000e+00 2.718282e+00 7.389056e+00 2.008554e+01 5.459815e+01 1.484132e+02 4.034288e+02 1.096633e+03 2.980958e+03
## [10] 8.103084e+03 2.202647e+04 5.987414e+04 1.627548e+05 4.424134e+05

Suppose that organisms 2, 3, 4, 5, 6, 7, 8, 9, 10 follow the baseline behaviour and organism 1 follows the exponential behaviour. We will represent the true concentrations of each organism with a matrix C as follows:

C <- matrix(
  data = NA,
  nrow = time_window,
  ncol = n_organism
)

C[, exponential_indices] <- conc_exponential
C[, baseline_indices] <- conc_baseline
C
##               [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,] 1.000000e+00  100  100  100  100  100  100  100  100   100
##  [2,] 2.718282e+00  100  100  100  100  100  100  100  100   100
##  [3,] 7.389056e+00  100  100  100  100  100  100  100  100   100
##  [4,] 2.008554e+01  100  100  100  100  100  100  100  100   100
##  [5,] 5.459815e+01  100  100  100  100  100  100  100  100   100
##  [6,] 1.484132e+02  100  100  100  100  100  100  100  100   100
##  [7,] 4.034288e+02  100  100  100  100  100  100  100  100   100
##  [8,] 1.096633e+03  100  100  100  100  100  100  100  100   100
##  [9,] 2.980958e+03  100  100  100  100  100  100  100  100   100
## [10,] 8.103084e+03  100  100  100  100  100  100  100  100   100
## [11,] 2.202647e+04  100  100  100  100  100  100  100  100   100
## [12,] 5.987414e+04  100  100  100  100  100  100  100  100   100
## [13,] 1.627548e+05  100  100  100  100  100  100  100  100   100
## [14,] 4.424134e+05  100  100  100  100  100  100  100  100   100

Sampling \(k\)-mers

We assume that the true concentration of each \(k\)-mer is that of the corresponding organism. This may be represented by copying each column of the matrix C 100 times.

K <- matrix(rep(as.numeric(t(C)), each = n_kmer), nrow = time_window, byrow = TRUE)

The matrix K now has 14 rows, one for each day, and 600 columns, one for each \(k\)-mer (of which there are 60 times 10). We can calculate proportions, where the total number of \(k\)-mers is given by the row sums:

K_norm <- t(apply(K, 1, function(x) x / sum(x), simplify = TRUE))
# useful::topleft(K_norm)
# useful::bottomleft(K_norm)

One way to represent the sequencing process is as a sample from the collection of \(k\)-mers. For example, we could consider a multinomial sample with probabilities of sampling each \(k\)-mer given by K_norm and sample size given by the read depth 10^{4}.

To demonstrate this, suppose we simulate the sequencing process on day 1. The proportions of each \(k\)-mer are given by K_norm[1, ], and we may sample using rmultinom. A histogram of the \(k\)-mer counts at day 1 under each regime, showing that the exponential regime is initialised at low count numbers, is given by:

sample_one <- rmultinom(1, read_depth, K_norm[1, ])

get_regime <- function(organism_index) {
  if(organism_index %in% baseline_indices) return("baseline")
  if(organism_index %in% exponential_indices) return("exponential")
  else return(NA)
}

#' Testing that this function works as intended
purrr::map_chr(1:11, get_regime)
##  [1] "exponential" "baseline"    "baseline"    "baseline"    "baseline"    "baseline"    "baseline"    "baseline"   
##  [9] "baseline"    "baseline"    NA
data.frame(count = sample_one) %>%
  mutate(
    organism = rep(1:n_organism, each = n_kmer),
    regime = str_to_title(purrr::map_chr(organism, get_regime)),
  ) %>%
  ggplot(aes(x = count, group = regime, fill = regime)) +
    geom_histogram(alpha = 0.8) +
    labs(x = "k-mer count", y = "Occurances", fill = "Regime", title = "k-mer counts at day 1") +
    scale_fill_manual(values = cbpalette) +
    theme_minimal()

Now, we will take multinomial samples from all of the days with a call to apply:

sample <- apply(K_norm, 1, function(row) rmultinom(1, read_depth, row))

colnames(sample) <- paste0("day", 1:ncol(sample))
rownames(sample) <- paste0(1:nrow(sample))

sample_df <- sample %>%
  as.data.frame() %>%
  tibble::rownames_to_column("id") %>%
  pivot_longer(
    cols = starts_with("day"),
    names_to = "day",
    values_to = "count",
    names_prefix = "day"
  ) %>%
  mutate(
    id = as.numeric(id),
    day = as.numeric(day),
    kmer = rep(rep(1:n_kmer, each = time_window), times = n_organism),
    organism = rep(1:n_organism, each = n_kmer * time_window)
  )

The data frame sample_df filtered to the first \(k\)-mer from the first organism is given by:

Let’s plot the data from organism 1 which we have set to be exponentially increasing:

sample_summary <- sample_df %>%
  filter(organism %in% exponential_indices) %>%
  group_by(day) %>%
  summarise(
    count_upper = quantile(count, 0.95),
    count_median = median(count),
    count_lower = quantile(count, 0.05)
  )

ggplot(sample_summary, aes(x = day, ymin = count_lower, y = count_median, ymax = count_upper)) +
    geom_ribbon(alpha = 0.1) +
    geom_point() +
    geom_point(data = filter(sample_df, organism == 1), aes(x = day, y = count, col = kmer),
               alpha = 0.1, inherit.aes = FALSE) + 
    theme_minimal() +
    labs(x = "Day", y = "Number of reads in sample", col = "k-mer")

pickout_day <- 10

With these settings, on day 10 the median count of organism 1 is 150.5. At this stage, the exponential growth curve has started to level off because \(k\)-mers from organism 1 already represent 90.3% of the total reads. This is unrealistic as it would be unlikely, or for the novel pathogens we are considering essentially impossible, for any one organism to saturate the space.

Now, we will start to make this simulation more realistic.

Adding noise to the true abundances

In reality, the organism abundances will vary, neither being fixed to a constant or increasing deterministically. We will now suppose that the abundances vary stochastically as follows.

Baseline

For the baseline concentration, we will use a lognormal geometric random walk such that \[ c^{(b)}_t = c^{(b)}_{t - 1} \times \exp \left( \epsilon^{(b)}_t \right) \] where \(\exp(\epsilon^{(b)}_t) \sim \mathcal{N}(0, \sigma_b^2)\). Another way to calculate the baseline concentration at time \(t\), \(c^{(b)}_t\), is as \[ c^{(b)}_t = c^{(b)}_{0} \times \exp \left( \sum_{s = 1}^t \epsilon^{(b)}_s \right) \] Note that the expected value of the cumulative sum of innovations is \[ \mathbb{E} \left[ \sum_{s = 1}^t \epsilon^{(b)}_s \right] = \sum_{s = 1}^t \mathbb{E} \left[ \epsilon^{(b)}_s \right] = 0, \] and the variance is \[ \mathbb{Var} \left[ \sum_{s = 1}^t \epsilon^{(b)}_s \right] = \sum_{s = 1}^t \mathbb{Var} \left[ \epsilon^{(b)}_s \right] = t\sigma_b^2. \]

We can use the formula for the moment generating function of a Gaussian random variable \(X \sim \mathcal{N}(\mu, \sigma^2)\) \[ m_X(u) = \mathbb{E} \left[ e^{uX} \right] = \exp(\mu u + \sigma^2u^2 / 2) \] to know that the expected concentration at time \(T\), say, under this model is \[ \mathbb{E} \left[ c^{(b)}_t \right] = c^{(b)}_{0} \exp(\sigma_b^2/2) > c^{(b)}_{0}. \]

n_baseline_paths <- 8

To show how this behaviour looks, let’s simulate 8 baseline paths.

baseline_df <- data.frame(
  day = rep(1:time_window, times = n_baseline_paths), 
  epsilon = rnorm(time_window * n_baseline_paths),
  path_number = as.factor(rep(1:n_baseline_paths, each = time_window))
  ) %>%
  group_by(path_number) %>%
  arrange(path_number) %>%
  mutate(
    cumsum_epsilon = cumsum(epsilon),
    conc = baseline * exp(cumsum_epsilon)
  ) %>%
  ungroup()

baseline_df %>%
  pivot_longer(
    cols = c("epsilon", "cumsum_epsilon", "conc"),
    names_to = "variable",
    values_to = "value"
  ) %>%
  mutate(
    variable = recode_factor(variable, 
      "epislon" = "epsilon", 
      "cumsum_epsilon" = "cumsum(epsilon)",
      "conc" = "Concentration")
  ) %>%
  ggplot(aes(x = day, y = value, group = path_number, col = path_number)) +
    geom_line() +
    facet_wrap(~variable, ncol = 1, scales = "free") +
    scale_colour_manual(values = cbpalette) +
    theme_minimal() +
    labs(x = "Day", y = "", col = "Path number", title = "Baseline behaviour")

The key takeaway for me is the even though the noise is IID and Gaussian, when you take the cumulative sum and exponentiate it can lead to large deviations in concentration. For example, the maximum concentration value observed was 5.4605386^{5} – which is 5460.5386246 greater than the baseline concentration.

Exponential

For the exponential regime, we also suppose a geometric lognormal random walk \[ c^{(e)}_t = c^{(e)}_{t - 1} \times \exp \left( \epsilon^{(e)}_t \right) \] where \(\exp(\epsilon^{(e)}_t) \sim \mathcal{N}(r, \sigma_e^2)\) and the growth rate \(r > 0\). The expected concentration is \(c^{(b)}_{0} \times \exp(rt)\). Let’s simulate some paths from this distribution as before.

exponential_df <- data.frame(
  day = rep(1:time_window, times = n_baseline_paths), 
  epsilon = rnorm(time_window * n_baseline_paths, mean = r),
  path_number = as.factor(rep(1:n_baseline_paths, each = time_window))
  ) %>%
  group_by(path_number) %>%
  arrange(path_number) %>%
  mutate(
    cumsum_epsilon = cumsum(epsilon),
    conc = baseline * exp(cumsum_epsilon)
  ) %>%
  ungroup()

exponential_df %>%
  pivot_longer(
    cols = c("epsilon", "cumsum_epsilon", "conc"),
    names_to = "variable",
    values_to = "value"
  ) %>%
  mutate(
    variable = recode_factor(variable, 
      "epislon" = "epsilon", 
      "cumsum_epsilon" = "cumsum(epsilon)",
      "conc" = "Concentration")
  ) %>%
  ggplot(aes(x = day, y = value, group = path_number, col = path_number)) +
    geom_line() +
    facet_wrap(~variable, ncol = 1, scales = "free") +
    scale_colour_manual(values = cbpalette) +
    theme_minimal() +
    labs(x = "Day", y = "", col = "Path number", title = "Exponential behaviour")

Sequencing observation models

See Townes (2020) for a nice review of models for count data.

Dirichlet-multinomial model

The Dirichlet is a probability distribution over proportions. \(w \sim \text{Dirichlet}(\alpha_1, \ldots, \alpha_K)\) if \[ p(w; \alpha) = \frac{\Gamma(\sum_k \alpha_k)}{\prod_k \Gamma(\alpha_k)} w_1^{\alpha_1 - 1} w_2^{\alpha_2 - 1} \times \cdots \times w_K^{\alpha_K - 1} \]

dirichlet_draw <- gtools::rdirichlet(n = 1, alpha = K_norm[1, ])

Poisson-lognormal model

The multivariate Poisson-lognormal model (MPLN) (Chiquet, Mariadassou, and Robin 2021) is set-up for analysis of an abunance table, an \(n \times p\) matrix \(Y\) where \(Y_{ij}\) is the number of individuals from species \(j \leq p\) observed at site \(i \leq n\). “Species” could refer to an operational taxonomic unit (OTU) or amplicon sequence variant (ASV), “site” could refer to a sample or experiment, and “number of individuals” could refer to number of reads. One model is \[ Y_{ij} \, | \, Z_{ij} \sim \text{Poisson}(\exp(o_{ij} + Z_{ij})), \\ Z_i \sim \mathcal{N}(\mu_i, \Sigma), \] where \(Z_i\) are assumed to be independent across sites. Some moments of \(Y\) are given by: \[ \mathbb{E}(Y_{ij}) = \exp(o_{ij} + \mu_{ij} + \sigma_{jj} / 2) > 0 \\ \mathbb{V}(Y_{ij}) = \mathbb{E}(Y_{ij}) + \mathbb{E}(Y_{ij})^2 (\exp(\sigma_{jj} - 1)) > \mathbb{E}(Y_{ij}) \\ \mathbb{Cov}(Y_{ij}, Y_{ik}) = \mathbb{E}(Y_{ij}) \mathbb{E}(Y_{ik}) (\exp(\sigma_{jk}) - 1)). \] Note: 1. Expected count is greater than just adding the exponential of the mean of the latent layer to the expected log abundances because of the logarithmic link function. 2. The model displays over-dispersion as compared with a Poisson model where the variance is the same as the mean, because of the latent noise. (I’m still interested as to how this is different from e.g. a negative-binomial model, though I prefer to use random effects for noise as they are easier to modify and extend to include structure say [amongst other reasons] than likelihoods). 3. “Faithful correlation” in that the correlation and covariance have the same sign. As well, if \(\sigma_{jk} = 0\) then \(\mathbb{Cov}(Y_{ij}, Y_{ik}) = 0\).

Bibliography

Chiquet, Julien, Mahendra Mariadassou, and Stéphane Robin. 2021. “The Poisson-Lognormal Model as a Versatile Framework for the Joint Analysis of Species Abundances.” Frontiers in Ecology and Evolution 9: 588292.
Townes, F William. 2020. “Review of Probability Distributions for Modeling Count Data.” arXiv Preprint arXiv:2001.04343.

  1. The units don’t matter much.↩︎